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Abstract 

We consider model-based reinforcement learning in finite Markov De- 
cision Processes (MDPs), focussing on so-called optimistic strategies. In 
MDPs, optimism can be implemented by carrying out extended value it- 
erations under a constraint of consistency with the estimated model tran- 
sition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner 
(2009), which follows this strategy, has recently been shown to guarantee 
near-optimal regret bounds. In this paper, we strongly argue in favor of 
using the Kullback-Leibler (KL) divergence for this purpose. By studying 
the linear maximization problem under KL constraints, we provide an ef- 
ficient algorithm, termed KL-UCRL, for solving KL-optimistic extended 
value iteration. Using recent deviation bounds on the KL divergence, we 
prove that KL-UCRL provides the same guarantees as UCRL2 in terms of 
regret. However, numerical experiments on classical benchmarks show a 
significantly improved behavior, particularly when the MDP has reduced 
connectivity. To support this observation, we provide elements of com- 
parison between the two algorithms based on geometric considerations. 
Keywords : Reinforcement learning; Markov decision processes; Model- 
based approaches; Optimism; Kullback-Leibler divergence; Regret bounds 



1 Introduction 

In reinforcement learning, an agent interacts with an unknown environment, 
aiming to maximize its long-term payoff [inj . This interaction is modelled by a 
Markov Decision Process (MDP) and it is assumed that the agent does not know 
the parameters of the process and needs to learn directly from observations. The 
agent thus faces a fundamental trade-off between gathering experimental data 
about the consequences of the actions (exploration) and acting consistently with 
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past experience in order to maximize the rewards (exploitation). 

We consider in this article a MDP with finite state and action spaces for which 
we propose a model-based reinforcement learning algorithm, i.e., an algorithm 
that maintains running estimates of the model parameters (transitions proba- 
bilities and expected rewards) [SI UHl [IH US] ■ A well-known approach to balance 
exploration and exploitation, followed for example by the well-know algorithm 
R-MAX [4], is the so-called optimism in the face of uncertainty principle. It 
was first proposed in the multi-armed bandit context by [TT], and has been ex- 
tended since then to several frameworks: instead of acting optimally according 
to the estimated model, the agent follows the optimal policy for a surrogate 
model, named optimistic model, which is close enough to the former but leads 
to a higher long-term reward. The performance of such an algorithm can be 
analyzed in terms of regret, which consists in comparing the rewards collected 
by the algorithm with the rewards obtained when following an optimal policy. 
The study of the asymptotic regret due to [TT] in the multi-armed context has 
been extended to MDPs by [S] , proving that an optimistic algorithm can achieve 
logarithmic regret. The subsequent works [H [HI [3] introduced algorithms that 
guarantee non-asymptotic logarithmic regret in a large class of MDPs. In these 
latter works, the optimistic model is computed using the (or total variation) 
norm as a measure of proximity between the estimated and optimistic transition 
probabilities. 

In addition to logarithmic regret bounds, the UCRL2 algorithm of [3] is also 
attractive due to the simplicity of each extended value iteration step. In this 
case, optimism simply results in adding a bonus to the most promising transition 
(i.e., the transition that leads to the state with current highest value) while 
removing the corresponding probability mass from less promising transitions. 
This process is both elementary and easily interpretable, which is desirable in 
some applications. 

However, the extended value iteration leads to undesirable pitfalls, which 
may compromise the practical performance of the algorithm. First, the opti- 
mistic model is not continuous with respect to the estimated parameters - small 
changes in the estimates may result in very different optimistic models. More 
importantly, the optimistic model can become incompatible with the obser- 
vations by assigning a probability of zero to a transition that has actually been 
observed. Moreover, in MDPs with reduced connectivity, optimism results 
in a persistent bonus for all transitions heading towards the most valuable state, 
even when significant evidence has been accumulated that these transitions are 
impossible. 

In this paper, we propose an improved optimistic algorithm, called KL- 
UCRL, that avoids these pitfalls altogether. The key is the use of the KuUback- 
Leibler (KL) pseudo-distance instead of the metric, as in [S]. Indeed, the 
smoothness of the KL metric largely alleviates the first issue. The second issue 
is completely avoided thanks to the strong relationship between the geometry 
of the probability simplex induced by the KL pseudo-metric and the theory of 
large deviations. For the third issue, we show that the KL-optimistic model 
results from a trade-off between the relative value of the most promising state 
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and the statistical evidence accumulated so far regarding its reachability. 

We provide an efficient procedure, based on one-dimensional line searches, 
to solve the linear maximization problem under KL constraints. As a conse- 
quence, the numerical complexity of the KL-UCRL algorithm is comparable to 
that of UCRL2. Building on the analysis of [HI 131 [I] , we also obtain logarithmic 
regret bounds for the KL-UCRL algorithm. The proof of this result is based on 
novel concentration inequalities for the KL-divergcnce, which have interesting 
properties when compared with those traditionally used for the norm. Al- 
though the obtained regret bounds are comparable to earlier results in term of 
rate and dependence in the number of states and actions, we observed in prac- 
tice significant performance improvements. This observation is illustrated using 
benchmark examples (the RiverSwim and SixArms environments of |14| ) and 
through a thorough discussion of the geometric properties of KL neighborhoods. 

The paper is organized as follows. The model and a brief survey of the 
value iteration algorithm in undiscountcd MDPs arc presented in Section [21 
Section [31 and [H arc devoted, respectively, to the description and the analysis 
of the KL-UCRL algorithm. Section [51 contains numerical experiments and 
Section [6l concludes the paper by discussing the advantages of using KL rather 
than confidence neighborhoods. 

2 Markov Decision Process 

Consider a Markov decision process (MDP) M = {X, P, r) with finite state 
space X, and action space A. Let Xt G X and At £ A denote respectively 
the state of the system and the action chosen by the agent at time t. The 
probability to jump from state Xt to state Xt^i is denoted by P{Xt+i; Xt, At). 
Besides, the agent receives at time t a random reward Rt € [0, 1] with mean 
r{Xt,At). The aim of the agent is to choose the sequence of actions so as to 
maximize the cumulated reward. His choices are summarized in a stationary 
policy TT : X ^ A. 

In this paper, we consider communicating MDPs, i.e., MDPs such that for 
any pair of states x, x', there exists policies under which x' can be reached from 
X with positive probability. For those MDPs, it is known that the average reward 
following a stationary policy tt, denoted by p^CM.) and defined as 



is state- independent [13]. Let 7r*(M) : X A and p*(M) denote respectively 
the optimal policy and the optimal average reward: p*(M) = sup^p'^(M) = 
p-^'(^){M) . The notations p*{M) and 7r*(M) are meant to highlight the fact 
that both the optimal average reward and the optimal policy depend on the 
model M. The optimal average reward satisfies the so-called Bellman optimality 




3 



equation: for a\\ x G X, 



h*{M,x)+ p*{M) 



Taa,xir{x,a)+^P{x';x,a)h*{M,x')\ , 

"^"^ V x'GX J 



where the | A" | -dimensional vector ft,*(M) is called a bias vector. Note that it is 
only defined up to an additive constant. For a fixed MDP M, the optimal policy 
7r*(]V[) can be derived by solving the optimality equation and by defining, for 
all X € X, 

7r*(M, x) € argmax r{x,a) + P{x';x,a)h*{M,x) . 

In practice, the optimal average reward and the optimal policy may be com- 
puted, for instance, using the value iteration algorithm |f 3| . 



3 The KL-UCRL algorithm 

In this paper, we focus on the reinforcement learning problem in which the 
agent docs not know the model M beforehand, i.e. the transition probabilities 
and the distribution of the rewards arc unknown. More specifically, we con- 
sider model-based reinforcement learning algorithms which estimate the model 
through observations and act accordingly. Denote by Pt{x';x,a) the estimate 
at time t of the transition probability from state x to state x' conditionally to 
the action a, and, by f((x, a) the mean reward received in state a; when action 
a has been chosen. We have: 

Nt{x,a,x') 



Pt{x';x,a) 



max(7Vi(a;, a), 1) 



rt{x,a) = — — — , (1) 

ma,x(Nt(x, a),l) 

where Nt{x,a,x') ^ Y.Iz}q 'i-{Xk=x,Ak=a,Xk+i=x'} is the number of visits, up to 
time t, to the state x followed by a visit to x' when the action a has been 
chosen, and similarly, Nt{x,a) = X]I=o -'-{^fe=a:,Afe=a}- The optimal policy in 
the estimated model Mt = {X,A,Pt,ft) may be misleading due to estimation 
errors: pure exploitation policies are commonly known to fail with positive 
probability. To avoid this problem, optimistic model-based approaches consider 
a set Ait of potential MDPs including Mt and choose the MDP from this set 
that leads to the largest average reward. In the following, the set Mt is defined 
as follows: 

Mt = {M = {X,A,P,r) -.VxGXyaGA, 
\ftix,a) ~ r{x,a)\ <eR{x,a,t) 
and d{Pt{.; x, a), P(.; x, a)) < ep{x, a, t)} , 
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where d measures the difference between the transition probabiUties. The ra- 
dius of the neighborhoods eri{x,a,t) and ep{x,a,t) around, respectively, the 
estimated reward ftix,a) and the estimated transition probabihties Pt{.;x^a), 
decrease with Nt{x,a). 

In contrast to UCRL2, which uses the L^-distancc for d, we propose to 
rely on the KuUback-Lcibler divergence, as in the seminal article [5]; however, 
contrary to the approach of [5] , no prior knowledge on the state structure of the 
MDP is needed. Recall that the KuUback-Leibler divergence is defined for all 
n-dimensional probability vectors p and q by KL{p,q) = Y^^=iPi^'^S^ (with 
the convention that OlogO = 0). In the sequel, we will show that this choice 
dramatically alters the behavior of the algorithm and leads to significantly better 
performance, while causing a limited increase of complexity; in Section [51 the 
advantages of using a KL-divergence instead of the L^-norm are illustrated and 
argumented. 

3.1 The KL-UCRL algorithm 

The KL-UCRL algorithm, described below, is a variant of the efficient model- 
based algorithm UCRL2, introduced by [l] and extended to more general MDPs 
by [3]. The key step of the algorithm, the search for the optimistic model (Step 
8), is detailed below as Algorithm [2l 



Algorithm 1 KL-UCRL 

1: Initialization: j ~ 0, to — 0; Va G A,yx G X,no{x,a) = 0, No{x,a) — 0; 

initial policy ttq. 
2: for alH > 1 do 
3: Observe Xt 

4: if Hj {Xt , TTj iXt))> max(iVt^. {Xt , tt^ (X* ) ) , 1 ) then 
5: Begin a new episode: j = j + tj — t, 

6: Reinitialize: Va € AjWx € X , nj{x^ a) = 

7: Estimate Pt and ft according to ([T]) 

8: Find the optimistic model Mj e M.t and the related policy ttj solving 

equation ([2]) and using Algorithm [2] 
9: end if 

10: Choose action At — TTj{Xt) 
11: Receive reward Rt 

12: Update the count within the current episode: 

nj{Xt,At)^nj{Xt,At) + l 
13: Update the global count: 

Nt{Xt,At) = Nt^i{Xt,At) + l 

14: end for 
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The KL-UCRL algorithm proceeds in episodes. Let tj be the starting time 
of episode j; the length of the j-th episode depends on the number of visits 
Nt- {x, a) to each state-action pair {x, a) before tj compared to the number of 
visits nj{x,a) to the same pair during the j-th episode. More precisely, an 
episode ends as soon as nj{x,a) > Ntj{x,a) for some state-action pair {x,a). 
The policy iTj, followed during the j-th episode, is an optimal policy for the 
optimistic MDP M^- = {X,A,Pj,rj) G -Mtj, which is computed by solving the 
extended optimality equations: for all x € X 



h' (x) + p = maxmax r(a;, a) -I- > P(x ;x,a)h' (x ) ] (2) 

P,r aGA \ ^ — ' ' 

\ x'ex / 

where the maximum is taken over all P, r such that 

Cp 

Va;,Va, KL{Pt^{.;x,a),P{.;x,a)) < 



Va;,Va, \ft^{x,a) - r{x,a)\ < 



Nt,{x,a) ' 
Cr 



where Cp and Cp are constants which control the size of the confidence balls. 
The transition matrix Pj and the mean reward rj of the optimistic MDP Mj 
maximize those equations. The extended value iteration algorithm may be used 
to approximately solve the fixed point equation ^ [131 H] ■ 



3.2 Maximization of a linear function on a KL-ball 

At each step of the extended value iteration algorithm, the maximization prob- 
lem ([2]) has to be solved. For every state x and action a, the maximization of 
r{x,a) under the constraint that \ftj{x,a) — r{x,a)\ < Cp/ yjNt^ (x, a) is ob- 
viously solved taking r(a;, a) = rt^{x,a) + Cp/ y/Nt- {x, a), so that the main 
difBculty lies in maximizing the dot product between the probability vector 
q ~ P{.]x,a) and the value vector V = h* over a KL-ball around the fixed 
probability vector p ~ Pf. (.; x, a): 

max y'g s.t. KL{p,q)<e, (3) 

where V' denotes the transpose of V and §" the set of ?i-dimensional proba- 
bility vectors. The radius of the neighborhood e = Cp/Ntj{x,a) controls the 
size of the confidence ball. This convex maximization problem is studied in Ap- 
pendix |3 leading to the efficient algorithm presented below. Detailed analysis 
of the Lagrangian of ^ shows that the solution of the maximization prob- 
lem essentially relies on finding roots of the function / (that depends on the 
parameter V), defined as follows: for all v > max^g^ Vi, with Z = {i : pi > 0}, 

f{„)^Y.p,logi,,^V,)+logiY.-^] ■ (4) 
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In the special case where the most promising state Im has never been reached 
from the current state-action pair (i.e. pi^,j = 0), the algorithm makes a trade- 
off between the relative value of the most promising state Vij^, and the statistical 
evidence accumulated so far regarding its reachability. 



Algorithm 2 Function MaxKL 

Input A value function V, a probability vector p, a constant e 
Output A probability vector q that maximizes ((3)) 

1: Let Z = {i : Pi = 0} and Z = {i : pi > 0}. 

2: Let I* = Z n argmaxj Vi 

3: if /* 7^ and there exists i e I* such that f(Vi) < e then 

4: Let V = Vi and r = 1 — exp(/(:/) — e). 

5: For all i E I* , assign values of such that 

5Z ft = r . 

6: For all i e Z/I*, let qi = 0. 
7: else 

8: For all i e Z, let qi = 0. Let r = 0. 
9: Find 1/ such that f{iy) ~ e using Newton's method. 
10: end if 

11: For aU i e Z, let qi = ^"''^l' where qi = 



In practice, / being a convex positive decreasing function (see Appendix [B|). 
Newton's method can be applied to find v such that f{v) = e (in Step 9 of 
the algorithm), so that numerically solving ([3]) is a matter of a few iterations. 
Appendix |B] contains a discussion of the initialization of Newton's algorithm 
based on asymptotic arguments. 

4 Regret bounds 
4.1 Theoretical results 

To analyze the performance of KL-UCRL, we compare the rewards accumulated 
by the algorithm to the rewards that would be obtained, on average, by an agent 
playing an optimal policy. The regret of the algorithm after T steps is defined 
as in [9]: 

T 

Rcgrctj, = ^(p*(M)-i?t) . 
t=i 

We adapt the regret bound analysis of the UCRL2 algorithm to the use of 
KL-neighborhoods, and obtain similar theorems. Let 

Z?(M) — maxminEM.7r('''(a;, x')) , 

x.x' TT 
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where t{x, x') is the hitting time of x', starting from state x. The i?(M) constant 
will appear in the regret bounds. For all communicating MDPs M, D(M) is 
finite. Theorem [T] establishes an upper bound on the regret of the KL-UCRL 
algorithm with Cp and Cr defined as 



1 



log(T) J 



where B = log f Hlbl^^JiSSiZl) and 



log(4|A-||^|log(r)/^) 
1.99 



Theorem 1 With probability 1 ~ S, it holds that for T > 5, the regret of KL- 
UCRL is bounded by 

Regret^ < CD{m)\XW\A\T\og{\og{T) / 5) , 

for a constant C < 24 that does not dependent on the model. 

It is also possible to prove a logarithmic upper bound for the expected regret. 
This bound, presented in Thcorcm[2l depends on the model through the constant 
A(M) defined as A(M) = p*(M) - max^^p,(M)<p*(M) /o''(M). A(M) quantifies 
the margin between optimal and suboptimal policies. 

Theorem 2 For T > 5, the expected regret of KL-UCRL is bounded by 

megret^) < ^^^'(M)^^M^ + ^(M) , 

where C < 400 is a constant independent of the model, and C(M) is a constant 
which depends on the model (see f^). 

4.2 Elements of proof 

The proof of Theorem[T]is inspired from [9l[3]. Due to the lack of space, we only 
provide the main steps of the proof. First, the following proposition enables us 
to ensure that, with high probability, the true model M = {X,A, P, r) belongs 
to the set of models M.t at each time step. 

Proposition 1 For every horizon T > 1 and for (5 > 0, P (Vi < T , Me Mt) > 
1 - 25. 
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The proof relies on the two fonowing concentration inequalities due to O [S] : for 

all X G X, a G A, any Cp > 0, and Cp > 0, it holds that 

yt < T, KL{Pt(.;x, a), P{.; x, a)) > 



Nt{x,a) 



< 2e{Cp log(r) + \X\)e-f^ = 1 - (5) 
yt<T, \rt{x,a) ~ r{x,a)\ < 



^jNt{x,a) 



<41og(T)e-i-«9^« = l ^ 



X\\A\ 



Then, summing over all state-action pairs, Proposition [T] follows. 

Using Hoeffding's inequality, with high probability, the regret at time T can 
be written as the sum of a regret in each of the m(T) episodes plus an additional 
term Ce(T, S)) = ^Tlog{l/S)/2: 

Regret J, < ^ NT{x,a){p*{M) ~ r{x,a)) + Ce{T,d) 

{x,a) 
m(T) 

fe=l {x,a) 

Let Pk and tt^ denote, respectively, the transition probability matrix of the 
optimistic model and the optimal policy in the fc-th episode {I < k < m{T)). It 
is easy to show that (see [9] for details), with probability I — S, 

m[T) 

Regrety < X! X! '^k{x,TTk{x))[ 
k=i xex 
+ {rk{x,Trk{x)) - r(x,7rfc(x))) 

(Pfc(.;x,7rfc(a;)) - P{.;x,'n:k{x)))' hk 
+ (P(.;x,7rfc(a;))~e^)'/ifc] 
+ Ce(r,(5) , 

where hk is a bias vector, e^, (y) = 1 if a; = j/ and ex{y) = otherwise. We now 
bound each of the three terms in the previous summation. Denote by n^'° the 
row vector such that n^'°(a;) = nfc(x,7rfc(x)) and by r^^ (resp. r'^'') the column 
vector such that r^''{x) = rk{x,-Kk{x)) (resp. r'^''{x) = r{x,TTk{x))). Similarly 
P^'' (resp. P'^'') is the transition matrix if the policy itk is followed under the 
optimistic model (resp. the true model M). If the true model M e A^t^, 
we have for all x £ X, for all a G A, 



nl^irl'-^r-") < 2 ^ „,(x-, a)^/^^^ (6) 



{x.a) 
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Using Pinsker's inequality, and the fact that < D [S], 

< ^ nk{x,a)\\Pk{.;x,a) - P(.; x, a)||J|/ife||^ 



<2i,V^i:„.(„,^^. (7) 

The third term ri^^{P'^^ — I)hk may be written as follows: 

t=tk 

where is the all O's vector with a 1 only on the x-th component. For all 
t e [tfc,tfc+i — l], note that = (P(.; Xt, At)— ext+J/ifc is a martingale difference 
upper-bounded by D. Applying the Azuma-Hoeffding inequality, we obtain that 

m(T) T 

nl-{P-^-I)hu^Y.^t+m{T)D 
fc=i t=i 



<o,II}^.Mnn (8) 



with probability 1 — S. In addition, Auer and al [T] proved that 
k=i x,a V^t,{x,a) 

and 

m(r)< |A-||^|log 



X\\A\, 

Combining all the terms completes the proof of Theorem [TJ The proof of The- 
orem [5] follows from Theorem [1] using the same arguments as in the proof of 
Theorem 4 in [9]. 



5 Numerical experiments 

To compare the behavior of algorithms KL-UCRL and UCRL2, we consider 
the benchmark environments RiverSwim and SixArms proposed by [14| as well 
as a collection of randomly generated sparse environments. The RiverSwim 
environment consists of six states. The agent starts from the left side of the 
row and, in each state, can either swim left or right. Swimming to the right 
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Figure 1: RiverSwim Transition Model: the continuous (rcsp. dotted) arrows 
represent the transitions if action 1 (resp. 2) has been chosen. 

(against the current of the river) is successful with probability 0.35; it leaves the 
agent in the same state with a high probability equal to 0.6, and leads him to 
the left with probability 0.05 (see Figure [T|). On the contrary, swimming to the 
left (with the current) is always successful. The agent receives a small reward 
when he reaches the leftmost state, and a much larger reward when reaching the 
rightmost state - the other states offer no reward. This MDP requires efficient 
exploration procedures, since the agent, having no prior idea of the rewards, 
has to reach the right side to discover which is the most valuable state-action 
pair. The SixArms environment consists of seven states, one of which (state 




Figure 2: SixArms Transition Model 

0) is the initial state. From the initial state, the agent may choose one among 
six actions: the action a G {1, . . . , 6} leads to the state x = a with probability 
Pa (see Figure [2]) and let the agent in the initial state with probability 1 ~ pa- 
From all the other states, some actions deterministically lead the agent to the 
initial state while the others leave it in the current state. Staying in a state 
a;€{l,...,6}, the agent receives a reward equal to Rx (see Figure [5]), otherwise, 
no reward is received. 

We compare the performance of the KL-UCRL algorithm to UCRL2 using 
20 Monte-Carlo replications. For both algorithms, the constants Cp and Cr 
are settled to ensure that the upper bounds of the regret of Theorem [T] and 
Theorem 2 in [9] hold with probability 0.95. In the SixArms environment, 
the received rewards being deterministic, we slightly modify both algorithms so 
that the agent knows them beforehand. We observe in Figure [3] and |4] that the 
KL-UCRL algorithm accomplishes a smaller average regret than the UCRL2 
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algorithm in those benchmark environments. In both environments, it is crucial 
for the agent to learn that there is no action leading from some states to the 
most promising one: for example, in the RiverSwim environment, between one 
of the first four states and the sixth state. 

In addition to those benchmark environments, a generator of sparse envi- 
ronments has been used to create 10-states and 5-actions environments with 
random rewards in [0,1]. In these random environments, each state is con- 
nected with, on average, five other states (with transition probabilities drawn 
from a Dirichlet distribution). We reproduced the same experiments as in the 
previous environments and display the average regret in Figure [51 



X 10^ 




2 4 6 8 10 



t x10^ 

Figure 3: Comparison of the regret of the UCRL2 and KL-UCRL algorithms in 
the RiverSwim environment. 



6 Discussion 

In this section, we expose the advantages of using a confidence ball based on 
the KuUback-Lcibler divergence rather than an L^-ball, as proposed for instance 
in [21 116j , in the computation of the optimistic policy. This discussion aims at 
explaining and interpreting the difference of performance that can be observed in 
simulations. In KL-UCRL, optimism reduces to maximizing the linear function 
V'q over a KL-ball (sec ([3])), whereas the other algorithms make use of an L^- 
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Figure 4: Comparison of the regret of the UCRL2 and KL-UCRL algorithms in 
the SixArms environment. 



5000 rr 




t xlO^ 

Figure 5: Comparison of the regret of the UCRL2 and KL-UCRL algorithms in 
randomly generated sparse environments. 



13 



ball: 



max V'q s.t. lip — gill < e' . (9) 



Continuity 

Consider an estimated transition probability vector p, and denote by q^^ (resp. 
q^) the probability vector which maximizes Equation (resp. Equation (O). 
It is easily seen that q^^ and q^ lie respectively on the border of the convex set 
{q € §l"^l : KL{p,q) < e} and at one of the vertices of the polytope {q G Sl'^' : 
Hp — ^lli < e'}- A first noteworthy diflterence between those neighborhoods is 
that, due to the smoothness of the KL-ncighborhood, q^^ is continuous with 
respect to the vector V, which is not the case for qi. 

To illustrate this, Figure IH] displays L^- and KL-balls around 3-dimensional 
probability vectors. The set of 3-dimensional probability vectors is represented 
by a triangle whose vertices are the vectors (1, 0, 0)', (0, 1, 0)' and (0, 0, 1)', the 
probability vector p by a white star, and the vectors q^^ and q^ by a white 
point. The arrow represents the direction of y's projection on the simplex and 
indicates the gradient of the linear function to maximize. The maximizer q^ 
can vary significantly for small changes of the value function, while q^"^^ varies 
continuously. 

Unlikely transitions 

Denote im — argmin^- Vj and im = argmax^ Vj . As underlined by [3] , q}^ = 
max.{pi^ — e'/2,0) and q]^_^ = min(p^^^ + e'/2, 1). This has two consequences: 

1. if p is such that < pi^ < e' /2, then the vector q}^ = 0; so the optimistic 
model may assign a probability equal to zero to a transition that has actu- 
ally been observed, which makes it hardly compatible with the optimism 
principle. Indeed, an optimistic MDP should not forbid transitions that 
really exists, even if they lead to states with small values; 

2. if p is such that pi^^ = 0, then qj^^ never equals 0; therefore, an opti- 
mistic algorithm that uses L^-balls will always assign positive probability 
to transitions to iM even if this transition is impossible under the true 
MDP and if much evidence has been accumulated against the existence of 
such a transition. Thus, the exploration bonus of the optimistic procedure 
is wasted, whereas it could be used more efficiently to favor some other 
transitions. 

This explains a large part of the experimental advantage of KL-UCRL observed 
in the simulations. Indeed, q^^ always assigns strictly positive probability to 
observed transitions, and eventually renounces unobserved transitions even if 
the target states have a potentially large value. Algorithm [5] works as follows: 
for all i such that pi 7^ 0, 7^ 0; for all i such that pi = 0, qi = except if 
= and if /(Vi^j) < e, in which case qt^, = 1 - exp{f{Vi^,) - e). But this 
is no longer the case when e becomes small enough, that is, when sufficiently 
many observations are available. We illustrate those two important differences 
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Figure 6: The Li-neigliborhood {<? G : \\p-q\\i < 0.2} (left) and KL- 
neigliborhood {q & : KL{p,q) < 0.02} (right) around the probabihty vector 
p = (0.15, 0.2, 0.65)' (white star). The white points are the maximizers of equa- 
tions © and dHI with V = (0, 0.05, 1)' (up) and V = (0, -0.05, 1)' (down). 

in Figure [71 by representing the and KL neighborhoods together with the 
maximizers q^^ and q^, first if is positive by very small, and second if pi„ is 
equal to 0. Figure [8] also illustrates the latter case, by representing the evolution 
of the probability vector q that maximizes both (j9]) and ([3]) for an example with 
p = (0.3, 0.7, 0)', V = (1, 2, 3)' and e decreasing from 1/2 to 1/500. 

A Linear optimization over a KL-ball 

This section explains how to solve the optimization problem of Equation ([3]). 
In [12], a similar problem arises in a different context, and a somewhat different 
solution is proposed for the case when the Pi are all positive. As a problem of 
maximizing a linear function under convex constraints, it is sufficient to consider 
the Lagrangian function 

N / N \ 

i(g, A,j/,pii,. . . ,pijv) = VgiVj - A Vp^log— - e ) 

CN \ ^ 

i=l I 1=1 
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Figure 7: The (left) and KL-neighborhoods (right) around the probabihty 
vector p = (0,0.4,0.6)' (up) and p = (0.05,0.35,0.6)' (down). The white point 
is the maximizer of the equations ^ and (O with V = (—1,-2,— 5)' (up) 
and V = (-1,0.05,0)' (down). We took, e = 0.05 (up), e = 0.02(down) and 
e' = V2i. 
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Figure 8: Evolution of the probability vector q that maximizes both ([3]) (top) 
and ^ (bottom) with p = (0.3, 0.7, 0)', V = (1, 2, 3)' and e decreasing from 1/2 
to 1/500 
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If (7 is a maximizer, there exist X € M., v, fii > {i 
following conditions are simultaneously satisfied: 



/ N 



/.jj = 



A 



' N 



1...N) such that the 

(10) 
(11) 



(12) 

H^Qi = (13) 

Let Z = {i,p, = 0}. Conditions ^ to ^ imply that A 7^ and ^ 0. For 
i € Z, Equation pUj) implies that qi — A _ ■ Since A 7^ 0, > and then, 



according to ([T3|. fii 



0. Therefore, 



(14) 



Let r 
have 



E 



Summing on i G Z and using equations (|14p and p2l) . we 

(15) 

Using ([T^ and (fT5|) . wc can write J2iezPi = f{v) ^ log(l — r) where / is 

defined in ([4|). Then, q satisfies condition (fTTj) if and only if f{v) = e+log(l — r) . 

Consider now the case where i G Z. Let I* = Z C] argmaxj Vi. Note that, 
for all i G Z \ I* , qi ~ 0. Indeed, otherwise, should be zero, and then 
v ^ Vi according to ((TO)) . which involves a possible negative denominator in p4)) . 
According to ([T3)) . for all i G /*, either = or = 0. The second case implies 
that V = Vi and r > which requires that f{v) < e so that ([A| can be satisfied 
with r > 0. Therefore, 

• if f{Vi) < e for i G /*, then ly ^ Vi and the constant r can be computed 
solving equation f{v) = e — log(l — r); the values of qi for i £ I* may be 
chosen in any way such that Eie/* ~ 



• if for all i G /* /(y^ > e, then r 
solution of the equation f{v) = e. 



0, qi = for all i e Z and is the 



Once ly and r have been determined, the other components of q can be computed 
according to we have that for i £ Z, qi = where q,; — 



B Properties of the / function 

In this section, a few properties of function / defined in Equation Q are stated, 
as this function plays a key role in the maximizing procedure of Section [3.21 
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Proposition 2 f is a convex, decreasing mapping from ] max^g^ T^ ; oo[ onto 
]0;(»[. 



Proof 1 Using Jensen's inequality, it is easily shown that the f function de- 
creases from +00 to 0. The second derivative of f with respect to v is equal 




If Z denotes a positive random value such that P [^Z = j = pi, then 

_ 2E(Z3)E(Z) - E{Z^)E{Zf - EjZ^)^ 
^ ~ E(zj2 " 

Using Cauchy-Schwartz inequality, we have E{Z'^f = E{Z'^I'^ Z^l'^f < E{Z^)E{Z). 
In addition E{Z'^Y > E{Z^)E{Z)'^ . These two inequalities show that f"{v) > 0. 

As mentioned in Section 13.21 Newton's method can be applied to solve the 
equation f{v) = e for a fixed value of e. When e is close to 0, the solution of this 
equation is quite large and an appropriate initialization accelerates convergence. 
Using a second-order Taylor's- series approximation of the function /, it can be 
seen that, fori/ near 00, /(i/) = ^^-l-o(^), where (Tp,y = J2iPi^i^-iJ2iPi^)'^- 
The Newton iterations can thus be initialized by taking i^o = \/fp,v/(2e). 
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